Fix ESP32 RSA watchdog timeout during key generation#3018
Fix ESP32 RSA watchdog timeout during key generation#3018valientegaston wants to merge 6 commits intotoitlang:masterfrom
Conversation
|
I also noticed that CONFIG_MBEDTLS_HARDWARE_MPI is explicitly set to n across all ESP32 toolchain configurations (esp32, esp32s2, esp32s3, esp32c3, esp32c6). Was this intentional? I understand it wouldn't help with key generation (as the prime generation and Miller-Rabin tests run in software regardless), but it could potentially speed up everyday RSA operations like signing, verification, and encryption/decryption. |
|
Primitives shouldn't be blocking. We want to run other tasks when waiting for a primitive. As such, the scheduler revealed a real issue.
I'm guessing we set the MBEDTLS_HARDWARE_MPI to 'n' to reduce the image size. We can definitely enable it again. |
… MPI support across ESP32 targets.
|
Hi! I have implemented the RSA key generation following your suggestions: Non-blocking: The primitives are now non-blocking. The key generation runs in a separate thread, and the Toit Task is suspended while waiting for the resource event. RsaGeneration Resource: I created a new RsaGeneration resource (contained in StructTag as RsaGenerationResourceTag and RsaGenerationResourceGroupTag) to manage the lifecycle of the asynchronous task. Architecture: I followed the approach in async_posix.cc, using the AsyncEventThread to coordinate the background generation and the completion event. Flags & Configuration: The new resource and primitives are correctly guarded by the CONFIG_TOIT_CRYPTO_EXTRA sdkconfig flag as you proposed. I've also updated the sdkconfig.defaults for ESP32 targets to enable this by default. Binary Compatibility: I made sure to append the new tags at the end of the StructTag enum to avoid any issues with existing bootstrap snapshots. Please let me know if there's any part of the implementation you'd like me to refine further. Thanks! |
- Add RsaGenerationResource_ class with finalizer/close pattern to ensure proper cleanup of resource group and state on exceptions (e.g. with-timeout). - Add rsa-generate-close primitive to tear down the resource group. - Fix dangling proxy in rsa_generate_finish error paths by adding missing clear_external_address calls. - Add no-op Thread::cancel() for ESP32 (required since async_posix is now compiled for ESP32).
|
Opened a fix PR on your fork: valientegaston#1 It addresses a few resource cleanup issues:
|
Fix RSA resource cleanup and ESP32 Thread::cancel
| free(prv); free(pub); | ||
| } else { | ||
| // mbedtls writes from the end. Move to start. | ||
| unsigned char* prv_start = (unsigned char*)malloc(prv_len); |
There was a problem hiding this comment.
If we don't have enough memory here, we would have to throw away everything we did so far and start again.
Also: I'm not even sure things would behave correctly: By marking the error as MBEDTLS_ERR_PK_ALLOC_FAILED the rsa_generate_finish would return a "malloc failed". This would indicate to the system that the primitive ran out of memory and that we should do a GC, then try again. However, the rsa_generate_finish would just fail now as the resource_proxy just has been cleared.
-> We need to allocate the memory in the main-thread (that has called the primitive) and not inside the event-thread. Unfortunately, this means that we need to know the prv_len und pub_len already before we actually call mbedtls_pk_write_key_der.
That said: it looks to me like a resize is the better option anyway. The prv_len/pub_len just seem to say how much of the given buffer was written too.
There was a problem hiding this comment.
Replaced the two extra malloc calls with memmove on the same buffer to shift the data to the start, followed by realloc to shrink it to the exact size. If realloc fails it returns null without altering the original block, so the original pointer is preserved and passed to set_results correctly — no memory leak or dangling pointer.
There was a problem hiding this comment.
The same still applies to the 'malloc' in line 986/987. (the ret goes to the rsa_generate_finish, leading to an unrecoverable OOM). The mallocs need to be done in the main thread. Here we just use constants (RSA_PRV_DER_MAX_BYTES), so that's relatively easy to fix.
| bool is_locked() const { return OS::is_locked(mutex_); } | ||
| bool is_boot_process(Process* process) const { return boot_process_ == process; } | ||
|
|
||
| Mutex* mutex() const { return mutex_; } |
There was a problem hiding this comment.
Necessary?
If yes, we should try to find a better approach than to make the scheduler's mutex public.
There was a problem hiding this comment.
No longer necessary. In the previous approach, the rsa_rng callback needed to take the scheduler mutex to update the process run_timestamp. With the new async design, the Toit process is suspended during key generation, so there's no need to update the timestamp from the RNG callback at all. The mutex() accessor has been removed.
| CONFIG_MBEDTLS_X509_TRUSTED_CERT_CALLBACK=y | ||
| CONFIG_MBEDTLS_ECP_RESTARTABLE=y | ||
| CONFIG_MBEDTLS_HARDWARE_MPI=n | ||
| CONFIG_MBEDTLS_HARDWARE_MPI=y |
There was a problem hiding this comment.
Is there any downside to enabling MPI?
For some esp32 variants we were fighting the partition sizes already. So if this increases the image size, we need to make sure we aren't too big now.
There was a problem hiding this comment.
I measured the binary size impact on ESP32:
- Without
HARDWARE_MPI: 0x1561f0 bytes (18% free of 0x1a0000 partition) - With
HARDWARE_MPI: 0x156390 bytes (18% free of 0x1a0000 partition)
The difference is only ~416 bytes.
I only tested on ESP32 so far — still checking the other variants.
There was a problem hiding this comment.
ESP32S3:
Without HARDWARE_MPI: 0x14d5c0 bytes (20% free)
With HARDWARE_MPI: 0x14d840 bytes (20% free)
Difference: ~640 bytes
Still checking the remaining variants (esp32c3, esp32c6, esp32s2).
There was a problem hiding this comment.
ESP32S2:
Without HARDWARE_MPI: 0x116a50 bytes (33% free)
With HARDWARE_MPI: 0x116ce0 bytes (33% free)
Difference: ~656 bytes
There was a problem hiding this comment.
ESP32C3:
Without HARDWARE_MPI: 0x163e00 bytes (14% free)
With HARDWARE_MPI: 0x164560 bytes (14% free)
Difference: ~1888 bytes
ESP32C6:
Without HARDWARE_MPI: 0x1818c0 bytes (11% free)
With HARDWARE_MPI: 0x182020 bytes (11% free)
Difference: ~1888 bytes
|
@valientegaston do you think you can address the feedback? |
|
Hi! Yes, I'll be addressing the feedback in the coming days. We've been quite busy lately but I'll make sure to go through all the comments and make the necessary changes. Thanks for your patience! |
…d remove unused scheduler mutex accessor
floitsch
left a comment
There was a problem hiding this comment.
Still some subtle issues.
Not completely sure what the best way for the resource handling is.
Maybe one way is to keep track of whether the parallel thread is running and store that in the resource. When the resource gets the signal to destroy itself, it could then delay that action until the thread returns. Alternatively (if possible) killing the thread first, would also work.
We do something slightly similar in the BLE code.
| } | ||
|
|
||
| if (ret == 0) { | ||
| unsigned char* prv = (unsigned char*)malloc(RSA_PRV_DER_MAX_BYTES); |
There was a problem hiding this comment.
When do we free the memory again? Shouldn't that happen in rsa_generate_finish?
But that's more complicated:
- what if nothing ever calls 'rsa_generate_finish'? This means that the memory needs to be attached to the resource.
- However, that immediately reveals another race condition (which was already present): what if the program stops while we are running in parallel. The
resresource could already be freed at that point.
| free(prv); free(pub); | ||
| } else { | ||
| // mbedtls writes from the end. Move to start. | ||
| unsigned char* prv_start = (unsigned char*)malloc(prv_len); |
There was a problem hiding this comment.
The same still applies to the 'malloc' in line 986/987. (the ret goes to the rsa_generate_finish, leading to an unrecoverable OOM). The mallocs need to be done in the main thread. Here we just use constants (RSA_PRV_DER_MAX_BYTES), so that's relatively easy to fix.
| memmove(prv, prv + RSA_PRV_DER_MAX_BYTES - prv_len, prv_len); | ||
| memmove(pub, pub + RSA_PUB_DER_MAX_BYTES - pub_len, pub_len); | ||
|
|
||
| unsigned char* prv_resized = (unsigned char*)realloc(prv, prv_len); |
There was a problem hiding this comment.
Maybe add the following information:
RSA_PRV_DER_MAX_BYTES is typically ~550 bytes.
Typically, keys only need ~300 bytes.
As such the realloc is useful.
However, if it fails, it's not the end of the world either.
| } | ||
|
|
||
| if (ret == 0) { | ||
| unsigned char* prv = (unsigned char*)malloc(RSA_PRV_DER_MAX_BYTES); |
There was a problem hiding this comment.
Use unvoid_cast instead of C-style casts. Here and in all the other places.
| } | ||
|
|
||
| ByteArray* prv_der = process->allocate_byte_array(resource->prv_len()); | ||
| ByteArray* pub_der = process->allocate_byte_array(resource->pub_len()); |
There was a problem hiding this comment.
Another option is to use an external ByteArray and use the existing buffers. That would decrease the peak amount of memory.
I think for 250+ bytes it's ok to have an external byte-array.
However, at that point I would do another realloc, just in case the one in the other thread didn't succeed. If it still doesn't succeed, you can do a MALLOC_FAILED.
| resource->resource_group()->unregister_resource(resource); | ||
| resource_proxy->clear_external_address(); |
There was a problem hiding this comment.
That's a bit too eager.
For OOMs the primitive will do a GC and then try again. If you destroy the resource the second attempt wouldn't work anymore.
Problem
On ESP32, generating an RSA 2048-bit key takes longer than 10 seconds, causing the watchdog timer to kill the process, mistakenly believing it is hung.
Root cause
The mbedtls RSA operations (rsa_generate, rsa_sign, rsa_encrypt, rsa_decrypt) were passing NULL as the context to the rsa_rng callback. The scheduler watchdog monitors the run_timestamp of each process and kills it if it hasn't been updated within the timeout window. Since the RSA key generation is a long-running blocking operation, the timestamp was never updated during its execution.
Fix
The rsa_rng callback is called many times by mbedtls during RSA operations — in particular during key generation, it is called repeatedly to generate prime candidates and random bases for Miller-Rabin primality tests. This makes it a natural and low-overhead place to reset the watchdog by updating the process run_timestamp.
The fix passes the Process* as the context to all rsa_rng calls, and inside rsa_rng the scheduler mutex is taken to safely update the timestamp.
Safety
We verified that primitives execute inside an Unlocker block in Scheduler::run_process, meaning the scheduler mutex is not held during RSA operations. Therefore, taking it inside rsa_rng is safe and will not cause a deadlock.